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Abstract 



The ultimate goal of regression analysis is to obtain information about the conditional 
distribution of a response given a set of explanatory variables. This goal is, however, sel- 
dom achieved because most established regression models only estimate the conditional 
mean as a function of the explanatory variables and assume that higher moments are not 
affected by the regressors. The underlying reason for such a restriction is the assumption 
of additivity of signal and noise. We propose to relax this common assumption in the 
framework of transformation models. The novel class of scmiparametric regression mod- 
els proposed herein allows transformation functions to depend on explanatory variables. 
These transformation functions are estimated by regularised optimisation of scoring rules 
for probabilistic forecasts, e.g. the continuous ranked probability score. The corresponding 
estimated conditional distribution functions are consistent. Conditional transformation 
models are potentially useful for describing possible heteroscedasticity, comparing spa- 
tially varying distributions, identifying extreme events, deriving prediction intervals and 
selecting variables beyond mean regression effects. An empirical investigation based on 
a heteroscedastic varying coefficient simulation model demonstrates that semiparametric 
estimation of conditional distribution functions can be more beneficial than kernel-based 
non-parametric approaches or parametric generalised additive models for location, scale 
and shape. 

Keywords: boosting, conditional distribution function, conditional quantile function, contin- 
uous ranked probability score, prediction intervals, structured additive regression. 



One of the famous "Top ten reasons to become a statistician" is that statisticians are "mean 
lovers" (Friedman, Friedman, and Amoo 2002), referring of course to our obsession with means. 
Whenever a distribution is too complex to think or expound upon, we focus on the mean as a 
single real number describing the centre of the distribution and block out other characteristics 
such as variance, skewness and kurtosis. Our willingness to simplify distributions this way 
is most apparent when we deal with many distributions at a time, as in a regression setting 
where we describe the conditional distribution ^y\x=x °f a response Y £ R given different 

*This is an extended version of Hothorn, Kneib, and Biihlmann (2013). Please cite this reference unless 
you're referring to the additional applications presented in this document, 
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configurations of explanatory variables X = x G x- Many regression models focus on the 
conditional mean E(Y|X = x) and treat higher moments of the conditional distribution 
^'y\x=x as fixed or nuisance parameters that must not depend on the explanatory variables. 
As a consequence, model inference crucially relies on assumptions such as homoscedasticity 
and symmetry. Information on the scale of the response, for example prediction intervals, 
derived from such models also depends on these assumptions. Here, we propose a new class 
of conditional transformation models that allow the conditional distribution function ¥{Y < 
v\X = x) to be estimated directly and semiparametrically under rather weak assumptions. 
Before we introduce this class of models in Section 2, we will attempt to set a scene of 
contemporary regression in the light of Gilchrist (2008) and place the major players onto this 
stage. 

Let Y x = (Y\X = x) ~ ^y\x=x denote the conditional distribution of response Y given 
explanatory variables X = x. We assume that ^y\x=x is dominated by some measure fi and 
has the conditional distribution function ¥(Y < v\X = x). A regression model describes the 
distribution ^y\x=xi or certain characteristics of it, as a function of the explanatory variables 
x. We estimate such models based on samples of pairs of random variables (Y, X) from 
the joint distribution Py 5 x- It is convenient to assume that a regression model consists of 
signal and noise, i.e. a deterministic part and an error term. In the following, we denote the 
error term by Q(U), where U ~ £/[0, 1] is a uniform random variable independent of X and 
Q : R — > R is the quantile function of an absolutely continuous distribution. 

Apart from non-parametric kernel estimators of the conditional distribution function (Hall, 
Wolff, and Yao 1999, Hall and Miiller 2003, Li and Racine 2008), there are two common ways 
to model the influence of the explanatory variables x on the response Y x : 

Y x = r(Q(U)\x) "mean or quantile regression models" and (1) 
h(Y x \x) = Q(U) "transformation models". (2) 

For each x 6 the regression function r(-\x) : R — > R transforms the error term Q(U) in a 
monotone increasing way. The inverse regression function h(-\x) = r~ l (-\x) : R — > R is also 
monotone increasing. Because h transforms the response, it is known as a transformation 
function, and models in the form of (2) are called transformation models. 

A major assumption underlying almost all regression models of class (1) is that the regression 
function r is the sum of the deterministic part r x : x — > R, which depends on the explanatory 
variables, and the error term: 

r(Q{U)\x) = r x {x) + Q(U). 

When K(Q(U)) = 0, we get r x (x) = K(Y\X = x), e.g. linear or additive models depending on 
the functional form of r x . Model inference is commonly based on the normal error assumption, 
i.e. Q(U) = cr$ _1 (?7), where a > is a scale parameter and <&~ l (U) ~ A/"(0,1). Linear 
heteroscedastic regression models (Carroll and Ruppert 1982) allow describing the variance as 
a function of the explanatory variables Q(U) = a(x)Q(U), where cr(x) is a (usually log-linear) 
function of x and Q is the quantile function of a symmetric distribution with K(Q(U)) = 0. In 
time series analysis, GARCH models (Bollerslev 1986) share this idea. A novel semiparametric 
approach is extended generalised additive models, where additive functions of the explanatory 
variables describe location, scale and shape (GAMLSS) of a certain parametric conditional 
distribution of the response given the explanatory variables (Rigby and Stasinopoulos 2005). 
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If the assumption of a certain parametric form of the conditional distribution is questionable, 
r x describes the r quantile of Y x when the quantile function Q is such that Q(t) = for some 
r € (0, 1). This leads us to quantile regression (Koenker 2005), which is according to Stigler 
(2010) the "best approach to robust methods in higher dimensional linear model problems". 
Estimating the complete conditional quantile function is less straightforward since we have 
to fit separate models for a grid of probabilities r, and the resulting regression quantiles may 
cross. Solutions to this problem can be obtained by combining all quantile fits in one joint 
model based on, for example, location-scale models (He 1997) or quantile sheets (Schnabel 
and Eilers 2012), or by monotonising the estimated quantile curves using non-decreasing 
rearrangements (Dette and Volgushev 2008). 

For transformation models (2), additivity is assumed on the scale of the inverse regression 
function h: 

h(Y x \x) = h Y {Y x ) + h x (x). 

When E(Q(U)) = 0, we get -h x (x) = E(h Y (Y x )) = E(hy(Y)\X = x). The monotone 
transformation function hy '■ K — >■ M does not depend on x and might be known in advance 
(Box-Cox transformation models with fixed parameters, accelerated failure time models) or 
is commonly treated as a nuisance parameter (Cox model, proportional odds model). One 
is usually interested in estimating the function h x : \ ~ * which describes the conditional 
mean of the transformed response hy{Y x ). The class of transformation models is rich and 
very actively researched, most prominently in literature on the analysis of survival data. 
For example, the linear Weibull accelerated failure time model assumes a log transforma- 
tion hy(Y x ) = log(Ya;), a linear function for the conditional mean of the log-transformed 
response h x (x) = x T a, and a Weibull-distributed error term Q(U) = o"Qweibuii(^0- For 
the Cox additive model, hy(Y x ) = log(A(Ya;)) is based on the unspecified integrated baseline 
hazard function A, h x {x) = J2j=i hx,j(x) is the sum of J smooth terms depending on the 
explanatory variables and Q(U) = — log(— log([/)) is the quantile function of the extreme 
value distribution. The proportional odds model has hy(Y x ) = log(r(Ya;)), with T being an 
unknown monotone increasing function, and Q(U) = log([//(l — U)) is the quantile function 
of the logistic distribution. Doksum and Gasko (1990) discussed the flexibility of this class of 
models, and Cheng, Wei, and Ying (1995) introduced a generic algorithm for linear transfor- 
mation model estimation, that is, for models with h x (x) = x a, treating the transformation 
function hy as a nuisance. 

In recent years, transformation models have been extended in two directions. In the first 
direction, more flexible forms for the conditional mean function h x have been introduced, 
e.g. the partially linear transformation model h x (x) = x T (0, a) T + ^-smooth (^i) (where /i S mooth 
is a smooth function of the first variable xy, Lu and Zhang 2010), the varying coefficient 
model h x (x) = a; T (0,0, q) t + h sra0 oth(xi)x2 (Chen and Tong 2010), random effects models 
(Zeng, Lin, and Yin 2005), and various approaches to additive transformation and accelerated 
failure time models, such as the boosting approaches by Lu and Li (2008) and Schmid and 
Hothorn (2008). In the second direction, a number of authors have considered algorithms that 
estimate hy and (partially) linear functions h x simultaneously, usually by a spline expansion 
of hy (e.g. Shen 1998, Cheng and Wang 2011), as an alternative to the common practise of 
estimating hy post-hoc by some non-parametric procedure such as the Breslow estimator. 

Although the transformation function hy is typically treated as an infinite dimensional nui- 
sance parameter, it is important to note that hy contains information about higher moments 
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of Y x , most importantly variance and skewness. Simultaneous estimation of hy and h x is 
therefore extremely attractive because we can obtain information about the mean and higher 
moments of the transformed response at the same time. However, owing to the decomposition 
of the regression function r or the transformation function h into both a deterministic part de- 
pending on the explanatory variables (r x or h x ) and a random part depending on the response 
(hy) or error term (Q(U)), higher moments of the conditional distribution of Y given X = x 
must not depend on the explanatory variables in mean regression and transformation models. 
As a consequence, the corresponding models cannot capture heteroscedasticity or skewness 
that is induced by certain configurations of the explanatory variables. Therefore, we cannot 
detect these potentially interesting patterns, and our models will perform poorly when prob- 
ability forecasts, prediction intervals or other functionals of the conditional distribution are 
of special interest. 

Recently, Wu, Tian, and Yu (2010) proposed a novel transformation model for longitudi- 
nal data that partially addresses this issue. For responses and explanatory variables X{t) 
observed at time t, the model assumes 

h(Y x \t, x) = h Y (Y x \t) + x(t) T a(t). 

Here, the transformation hy is conditional on time, and higher moments may vary with time. 
However, since hy does not depend on the explanatory variables x, these higher moments may 
not vary with one or more of the explanatory variables. In the context of longitudinal data 
with functional explanatory variables, Chen and Muller (2012) consider a similar model, where 
the regression coefficients for functional principle components may depend on time t and the 
response Y x . Our contribution is a class of transformation models where the transformation 
function is conditional on the explanatory variables in the sense that the transformation 
function, and therefore higher moments of the conditional distribution of the response, may 
depend on potentially all explanatory variables. As a consequence, the models suggested here 
are able to deal with heteroscedasticity and skewness that can be regressed on the explanatory 
variables, and we will show that reliable estimates of the complete conditional distribution 
function and functionals thereof can be obtained. 

We will introduce the conditional transformation models (Section 2), discuss the underly- 
ing model assumptions, and embed the estimation problem in the empirical risk minimisation 
framework (Section 3). For the sake of simplicity, we restrict ourselves to continuous responses 
Y that have been observed without censoring. Similar to other transformation models, condi- 
tional transformation models are flexible enough to deal with discrete responses and survival 
times, as will be discussed in later sections. We present a computationally efficient algorithm 
for fitting the models in Section 4. We study the asymptotic properties of the estimated con- 
ditional distribution functions in Section 5. The practical benefits of modelling the influence 
of explanatory variables on the variance and higher moments of the response' distribution 
are demonstrated in Section 6 with a special emphasis on distributional characteristics of 
childhood nutrition in India and on prediction intervals for birth weights of small foetus. Fi- 
nally, we use a heteroscedastic varying coefficient simulation model to evaluate the empirical 
performance of the proposed algorithm and compare the quality of conditional distribution 
functions estimated by a conditional transformation model and established parametric and 
non-parametric procedures in Section 7. 
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2. Conditional Transformation Models 

An attractive feature of transformation models is their close connection to the conditional 
distribution function. With the transformation function h(Y x \x) = Q(U), one can evaluate 
the conditional distribution function of response Y given the explanatory variables x via 

P(Y < v\X = x) = F(h(Y\x) < h(v\x)) = F(h(v\x)) 

with absolute continuous distribution function F = Q _1 . For additive transformation func- 
tions h = hy + h x , the conditional distribution function reads F(h(v\x)) = F(liY(v) + h x (x)), 
i.e. the distribution is evaluated for a transformed and shifted version of Y. Higher moments 
only depend on the transformation hy and thus cannot be influenced by the explanatory 
variables. Consequently, one has to avoid the additivity in the model h = hy + h x to al- 
low the explanatory variables to impact also higher moments. We therefore suggest a novel 
transformation model based on an alternative additive decomposition of the transformation 
function h into J partial transformation functions for all x € x : 

J 

h(v\x) = hj(v \x), (3) 

3=1 

where h{v\x) is the monotone transformation function of v. In this model, the transformation 
function h(Y x \x) and the partial transformation functions hj(-\x) : M — > M. are conditional on 
x in the sense that not only the mean of Y x depends on the explanatory variables. For this 
reason, we coin models of the form (3) Conditional Transformation Models (CTMs). Clearly, 
model (3) imposes an assumption, namely additivity of the conditional distribution function 
on the scale of the quantile function Q: 

J 

Q(P(Y < v\X = a:)) = ^h j {y\x). 

3=1 

It should be noted that here we assume additivity of the transformation function h and not 
additivity on the scale of the regression function r as it is common for additive mean or quantile 
regression models (1). Furthermore, monotonicity of hj is sufficient but not necessary for h 
being monotone. Of course, we have to make further assumptions on hj to obtain reasonable 
models, but these assumptions are problem specific, and we will therefore postpone these issues 
until Section 6. To ensure identifiability, we assume without loss of generality that the partial 
transformation functions are centred around zero KyKxhj(Y\X) = for all j = 1, . . . , J for 
non-systematic error terms (E(Q(t/)) = 0). 

3. Estimating Conditional Transformation Models 

The estimation of conditional distribution functions can be reformulated as a mean regression 
problem since ¥(Y < v\X = x) = E(/(Y < v)\X = x) for the binary event Y < v; this 
connection is widely used (e.g. by Hall and Miiller 2003, Chen and Miiller 2012). Similar 
to the approach of fitting multiple quantile regression models to obtain an estimate of the 
conditional quantile function, one could estimate the models E(/(Y < v)\X = x) for a grid 
of v values separately. However, we instead suggest estimating conditional transformation 
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models by the application of an integrated loss function that allows the whole conditional 
distribution function to be obtained in one step. 

Let p denote a function of measuring the loss of the probability F(h(v\X)) for the binary 
event Y < v. One candidate loss function is 



the negative log-likelihood of the binomial model (Y < v\X = x) ~ B(l, F(h(v\x))) for the 
binary event Y < v with link function Q = F . Alternatively, one may consider the squared 
or absolute error losses 



The squared error loss p sqe is also known as the Brier score, and the absolute loss p a \, e has 
been applied for assessing survival probabilities in the Cox model by Schemper and Henderson 
(2000). We define the loss function £ for estimating conditional transformation models as 
integrated loss p with respect to a measure p dominating the conditional distribution P(Y < 
v\X = x): 



In the context of scoring rules, the loss I based on p sqe is known as the continuous ranked 
probability score (CPRS) or integrated Brier score and is a proper scoring rule for assessing 
the quality of probabilistic or distributional forecasts (see Gneiting and Raftery 2007, for an 
overview). It seems natural to apply these scores as loss functions for model estimation, but 
we are only aware of the work of Gneiting, Raftery, Westveld III, and Goldman (2005), who 
directly optimise the CPRS for estimating Gaussian predictive probability density functions 
for continuous weather variables. In the context of non-parametric or semiparametric esti- 
mation of conditional distribution functions, minimisation of the empirical analog of the risk 
function 



for estimating conditional distribution functions has not yet been considered. 

Model estimation based on the risk Ey;x^((Y, X), h) is reasonable because the corresponding 
optimisation problem is convex and attains its minimum for the true conditional transforma- 
tion function h. We summarise these facts in the following lemma, whose proof in given in 
the Appendix. 

Lemma 1. The risk Ey;x^((Y, X), h) is convex in h for convex losses p in h. The population 
minimiser ofEY,x@((Y,X),h) for p = p^m and p = p sqe is h(v\x) = Q(¥(Y < v\X = x)). 
For p = p a be, the minimiser is 



Pbm((Y<v,X),h(v\X)) 



[I{Y < v)log{F(h(v\X))} + 

{1 - I(Y < v)} log{l - F(h(v\X))}] > 



Psqe ((Y <v,X),h(v\X)) 

Pabe(CT <V,X),h(v\X)) 



-\I(Y<v)-F(h(v\X))\ 2 >0 
|J(Y < v) -F(h(v\X))\ > 0. 





p((y < v,x),h(v\x))dp(v)dF Y ,x(y,x) > 



h(v\x) = 



— oo : 



oo : 



P(Y < v\X = x)< 0.5 
P(Y < v\X =x)> 0.5. 
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The corresponding empirical risk function denned by the data is 

E Y;X t{(Y,X)J) = f j p((y<v,x),h(v\x))dp(v)dF YX (y,x) > 0. 

Based on an i.i.d. random sample (Yi, Xi) ~ lPy,x^ = 1, ■ ■ ■ ,N of N observations from the 
joint distribution of response and explanatory variables, we define Py t x as the distribution 
putting mass Wi > on observation i (wi = N^ 1 for the empirical distribution). For com- 
putational convenience, we also approximate the measure p by the discrete uniform measure 
p, which puts mass n _1 on each element of the equi-distant grid v\ < • ■ • < v n € M over the 
response space. The number of grid points n has to be sufficiently large to closely approximate 
the integral. The empirical risk is then 

N n 

E Y ,xi((Y,X),h) = J2w in - l J2p(( Y i<^ x i),h(^\Xi)) (4) 

i=l i=l 
N n 

= n" 1 y~^y~^ Wjp((Yj < v l ,X i ),h(v l \X i )). 

i=l i=l 

This risk is the weighted empirical risk for loss function p evaluated at the observations (Yi < 
v t , Xi) for i = 1, . . . , N and i = 1, . . . , n. Consequently, we can apply algorithms for fitting 
generalised additive models to the binary responses Yi < v % under loss p for estimating model 
(3). Although this seems to be rather straightforward, there are two issues to consider. First, 
simply expanding the observations over the grid v\ < • • • < v n increases the computational 
complexity by n, which, even for moderately large sample sizes N, renders computing and 
storage rather burdensome. Second, unconstrained minimisation of the empirical risk, i.e. no 
smoothness of h in its first argument and h being independent of the conditioning x, leads 
to estimates F(h(v\x)) = P(Y < v) = N^ 1 YliLi I(Yi — v )i *- e - the empirical cumulative 
distribution function of Y for pbin and p sqe with Wi = N~ l . For /5 a bej the empirical risk is 
minimised by F(h(v\x)) = for all v with P(Y < v) < 0.5 and otherwise by F{h{v\x)) = 1. 
Therefore, careful regularisation is absolutely necessary to obtain reasonable models that 
lead to smooth conditional distribution functions (i.e. smoothing in the Y-direction) and that 
are similar for similar configurations of the explanatory variables {i.e. smoothing in the X- 
direction). Instead of adding a direct penalisation term to the empirical risk, we propose in 
the next section a boosting algorithm for empirical risk minimisation that indirectly controls 
the functional form and complexity of the estimate h. 



4. Boosting Conditional Transformation Models 

We propose to fit conditional transformation models (3) by a variant of component-wise 
boosting for minimising the empirical risk (4) with penalisation. In this class of algorithms, 
regularisation is achieved indirectly via the application of penalised base-learners and the 
complexity of the whole model is controlled by the number of boosting iterations. We refer 
the reader to Biihlmann and Hothorn (2007) for a detailed introduction to component-wise 
boosting. 

For conditional transformation models, we parameterise the partial transformation functions 
for all j = 1, . . . , J as 

hj(v\x) = (bj(x) T 8) 6 (^) T ) 7j e M, jj e R K J K °, (5) 
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where bj(x) <S> bo(v) denotes the tensor product of two sets of basis functions bj : x ~^ R j 
and b : R -> Here, 60 is a basis along the grid of v values that determines the functional 
form of the response transformation. The basis bj defines how this transformation may vary 
with certain aspects of the explanatory variables. The tensor product may be interpreted as a 
generalised interaction effect (further illustrated in Section 6). For each partial transformation 
function hj, we typically want to obtain an estimate that is smooth in its first argument v 
and smooth in the conditioning variable x. Therefore, the bases are supplemented with 
appropriate, pre-specified penalty matrices Pj G M. J x 1 and Pq G ~R k ° xK °, inducing the 
penalty matrix Pqj = (Ao-Pj<8)li<' +AjlK.<8>-Po) with smoothing parameters Ao > and Xj > 
for the tensor product basis. The base- learners corresponding to the partial transformation 
functions fitted to the negative gradients in each iteration of the boosting algorithm are then 
Ridge-type linear models with penalty matrix Poj. In more detail, we apply the following 
algorithm for fitting conditional transformation models with partial transformation functions 
of the form (5): 



Algorithm: Boosting for Conditional Transformation Models 



(Init) Initialise the parameters 7^ =0 for j = 1, . . . , J, the step-size v G (0, 1) and the 
smoothing parameters Xj, j = 0, . . . , J. Define the grid v\ < Yn\ < • • ■ < Y(n) — v n- 
Set m := 0. 



(Gradient) 

Compute the negative gradient: 



d_ 
dh 



h=ti: 



with h^ ] = e/ =1 (&i(**) T ® b (v t ) T ) i [ ; 

Fit the base- learners for j = 



^. = argmin^^^j^- (6 j (X i ) T ®foo(^) T )/3| +/3 T P 0j /3 (6) 

with penalty matrix Poj- 
Select the best base-learner: 



N n 
j=l,-,J j =1 l=1 



iv n r 

argmin^^^i ^U i% - (bj(Xi) T <g>6 (^) T ) ^j} 



(Update) the parameters 7^T' +1 ^ = 7^+^^* and keep all other parameters fixed, i.e. 7^ m+1 ^ 

[ml • / -j. 

Tj ,3 t J ■ 
Iterate (Gradient) and (Update) 
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(Stop) if m = M. Output the final model 



F(Y < v\X = x) = F (ti M ^ (v\x)) = f\J2 (bj(x) T ® 6 (^) T ) 7 J 




as a function of arbitrary v E M and x £ x- 

Before we investigate the asymptotic properties of the resulting estimates, we will discuss 
some details of this generic algorithm in the following. 

Model Specification. The basis functions &o an d bj determine the form of the fitted 
model, and their choice is problem specific. In the simplest situation, in which the conditional 
distribution of Y given only one numeric explanatory variable x\ shall be estimated, one could 
use the basis functions bo(v) = (l,f) and b\(x) = (l,xi) T . The corresponding base-learner 
is then defined by the linear function 



For each x\, the transformation is linear in v with intercept 71 + 73X1 and slope 72 + 74X1, 
i.e. not only the mean may depend on x\ but also the variance. Restricting, for example, 
b$(v) to be constant, i.e. b$(y) = 1, allows the effects of explanatory variables to be restricted 
to the mean alone. Assuming b\(x) = 1, on the other hand, yields a transformation function 
that is not affected by any explanatory variable. More flexible basis functions, e.g. i?-spline 
basis functions, allow also for higher moments to depend on the explanatory variables. We 
illustrate appropriate choices of basis functions in Section 6. 

Computational Complexity. For the estimation of base- learner parameters f3j in (6), 
it is not necessary to evaluate the Kronecker product (g) in (5) and to compute the nN x 
KqKj design matrix for the jih base-learner. The base-learners used here are a special form 
of multidimensional smooth linear array models (Currie, Durban, and Eilers 2006), where 
efficient algorithms for computing Ridge estimates (6) exist. The number of multiplications 
required for fitting the jth base-learner is approximately c 6 /(c 2 /N — 1), instead of N 2 c 4 for 
the simplest case with c = Kq = Kj and N = n (see Table 2 in Currie et al. 2006), and 
the required memory for storing the design matrices is of the order NKj + NKq, instead 
of NnKjKo. Note that only the gradient vector is of length Nn; all other objects can be 
stored in vectors or matrices growing with either iV or n, and an explicit expansion of the 
observations (Yi < v t , Xi) for i = 1, . . . , N and 1 = 1, . . . , n is not necessary. 

Choice of Tuning Parameters. The number of boosting iterations M is the most impor- 
tant tuning parameter determined by resampling, e.g. by fe-fold cross-validation or bootstrap- 
ping. For the latter resampling scheme, the weights Wi in (4) are drawn from an iV-dimensional 
multinomial distribution with constant probability parameters pi = iV -1 ,? = 1, . . .,N. The 
out-of-bootstrap (OOB) empirical risk with weights u>P° B = I{wi = 0) is then used as a 
measure to assess the quality of the distributional forecasts for a varying number of boosting 
iterations M. It should be noted that the loss function used to fit the models is the same 
function that is used as a scoring rule to assess the quality of the probabilistic forecasts of 
the OOB observations. 



((l,£i) (g> (1,V))J 1 = (1,V,X 1 ,X 1 V)^ 1 . 
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The smoothing parameters A,-, j = 0, . . . , J in the penalty matrices are not tuned but rather 
defined such that the j'th base-learner has low degrees of freedom. For our computations, 
we simplified the penalty term to Poj = Xj(Pj <8> 1k + ^-K* ® Po), *-e- one parameter 
controls the smoothness in both directions. Following Hofner, Hothorn, Kneib, and Schmid 
(2011a), the parameters Xj were defined such that each base-learner has the same overall low 
degree of freedom. Note that the degree of freedom of the estimated partial transformation 
function adapts to the complexity inherent in the data via the number of boosting iterations 
M (Buhlmann and Yu 2003). Different smoothness in the two directions can be imposed by 
choosing different basis functions for &o and bj, e.g. a linear basis function for bj and i?-splines 
for &o- 

Other parameters, such as knots or degrees of basis functions or the number n of grid points 
the integrated loss function t is approximated with are not considered as tuning parameters. 
The resulting estimates are rather insensitive to their different choices. Also, we do not 
consider the distribution function F or the loss function p as tuning parameter but assume 
that these are part of the model specification. Different versions of F and p lead to different 
negative gradients; these are given in the Appendix. 

Monotonicity. The resulting estimate h\ M ^ (v\x) is not automatically monotone in its first 
argument. Monotonicity and smoothness in the Y-direction depend on each other, and too- 
complex estimates tend to suffer from non-monotonicity. Empirically, based on experiments 
reported in Sections 6 and 7, non-monotonicity is a problem in poorly-fitting models, due 
to either misspecification, overfitting, or a low signal-to-noise ratio. From our point of view, 
inspecting the model for non-monotonicity is helpful for model diagnostics and can be dealt 
with by reducing model complexity. Alternatively, there are three possible modifications to 
the algorithm that can be implemented to enforce monotonicity: (i) fit base-learners under 
monotonicity constraints in (6), e.g. by using the iterative re-penalisation suggested by Eilers 
(2005) and applied to boosting by Hofner, Muller, and Hothorn (2011c), (ii) check mono- 
tonicity for each base-learner and select the best among the monotone candidates only, or 
(iii) select the base-learner such that it is the best one among all candidates that lead to 
monotone updates in h^ m K None of these approaches had to be used for our empirical stud- 
ies, in which all resulting estimates were monotone for the appropriate number of boosting 
iterations M. 

Model Diagnostics and Overfitting. Another convenient feature of transformation mod- 
els is that, with the correct model h for absolute continuous random variables Y, the errors 
E{ = h(Yi\Xi),i = 1, ...,N are distributed according to F. Therefore, if the observed 
residuals £"] M ' = h^ M ^(Yi\Xi) are unlikely to come from distribution F, e.g. assessed using 
quantile-quantile plots or a Kolmogorov-Smirnov statistic, the model is likely to fit the data 
poorly. However, a good agreement between E^ M ^ and F does not necessarily mean that the 
explanatory variables describe the response well. A high correlation between the ranking of 
the residuals and the ranking of the responses Y\ , . . . , Yv means that the estimated conditional 
distribution is very close to the unconditional empirical distribution of the responses. In this 
case, either the model may overfit or the response may be independent of the explanatory 
variables. 

The fitted model may also be used to draw novel responses for given explanatory variables 
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using the model-based bootstrap via 

Yi = [v : Q(Ui) = hM(v\Xi)} , i = l,...,N 

where U\, . . . ,Un are i.i.d. uniform random variables. The stability of the model can now be 
investigated by refitting the model with observations (Yi, Xi),i = 1, . . . , N. 

5. Consistency of Boosted Conditional Transformation Models 

The boosting algorithm presented here is a variant of L2WCBoost (Buhlmann and Yu 2003) 
applied to dependent observations with more general base-learners. In this section, we will 
develop a consistency result for the squared error loss p sqe . For simplicity, we consider the 
case in which the procedure is used with F(h) = h as the identity function, i.e. the error term 
is uniformly distributed. Thus, we consider conditional transformation models of the form 

,/ 

F(Y < v\X =x) = h(v\x) = ^2 hj(v\xj), 

i=i 

where the partial transformation function hj is conditional on the jth explanatory variable 
in x = (xx, . . . ,xj) G x an d Ey (N^ 1 YliLi h(Y\Xi)) = 1/2. Our analysis is for the fixed 
design case with deterministic explanatory variables Xi or when conditioning on all X{S. A 
modification for the random design case could be pursued along arguments similar to those 
for L2Boosting as in Buhlmann (2006). 

As in Section 4, we use a basis expansion of h(v\x): 

J J k q,n Ki,n 

h Nn (v\x) = (bj( x j) T ® b o(v) T ) 7j = X) YI 7hkoM b o,k ( v ) b iM( x i)' 

j=l j=l k =l fci=l 

where for the sake of simplicity the number of basis functions Kin is equal for all Xj. 
Consider the (empirical) risk functions 

N n 

RnA h ) = (^r 1 E ^ V ^ - H^lXi)) 2 

1=1 1=1 

and 

TV n 

Rn,N,dh) = (nN)- 1 ^^E[(/(Y < v t ) - h^X,)) 2 }. 

i=l i=l 

Denote the projected parameter by 

7 07V = argmin.R ni jv iE (/iAr i7 ). (7) 

7 

We make the following assumptions: 
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(Al) The coefficient vector ~y N is sparse and satisfies 

Thereby, the dimensionality J = Jjv can grow with N. 
(A2) The basis functions satisfy: for some < C < oo, 

< C, ||6j,fci ||oo < C Vj, fc , h. 

(A3) 

AT n 

(niV)- 1 ^^^^!^,) 2 < D < oo Vn,iV 
i=i i=i 

Assumption (Al) is an £i-norm sparsity assumption, (A2) is a mild restriction since we are 
modeling I(Y < v), and (A3) requires that the signal strength does not diverge as n, N — > oo. 

Theorem 1. Assume (A1)-(A3). Then, for fixed n or for n = un — > oo (N — > oo), and for 
M = M N -»• oo (A -»• oo), Mjv = o(y/N/\og(J N K 0jN K 1<N )): 

N n 

(nN)- 1 E( /t -yM ^0 " ^7o,iv(^l^)) 2 = op(1) oo). 
i=i i=i 

A proof is given in the Appendix. 

Convergence of h^ 0N (v\x) to the true function h{y\x) involves approximation theory to 
achieve 

N n 

(nN)- 1 E Y,(th 0tK (v t \Xi) - h(v t \X l )) 2 = o(l) (n, N -> oo). (8) 
i=i i=i 

We typically would want to estimate the function h(v\x) well over the whole domain, e.g. [a v , b v ] x 
X- This may be too ambitious if J = dim(%) = Jjy grows with N. Hence, we restrict ourselves 
to the setting where the number of active variables J ac t < oo is fixed (from the active set S), 
i.e. 

h(v\x) = ^2 hj(v\xj), S C {1, . . . , J} with \S\ = J act . 

For the approximation, we typically would need Kq^, — > oo (N — > oo) for suitable 
basis functions and n = njy — > oo (N — > oo); furthermore, the grid v\ < v 2 < . . . < v n 
should become dense as n = n^ — > 00, and also the values Xf ct , . . . , should become 
dense in xs Q X as N — > 00 (here, X act = {Xj; j G S} € Xs)- H J = Jn grows, but the 
number of active variables in the model J ac t < 00 is fixed, then some uniform approximation 
h-y 0N (v\x) — > h(v\x) is possible under regularity conditions. 

We provide a summary for a typical situation. 
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Corollary 1. Consider the setting as in Theorem 1, with J = Jn potentially growing but fixed 
dimensionality of the active variables J act < oo, n = njy — > oo (JV — > oo), and the functions 
are sufficiently regular such that (8) holds. Then, for M = Mn as in Theorem 1: 

N n 

(nN)- 1 X^V*] " h ^\ X i)) 2 = °p0) («> N -> «))• 

8=1 1=1 

This result states that the estimated H^m) are consistent for the true transformation func- 
tion h. 

6. Applications 

In this section, we present analyses with special emphasis on higher moments of the conditional 
distribution, which have received less attention in previous analyses of these problems. We 
show that semiparametric regression using conditional transformation models is a valuable 
tool for detecting interesting patterns beyond the conditional mean. 

"Evolution Canyon" Bacteria 

The Bacillus simplex populations from "Evolution Canyons" I and II in Israel have recently 
developed into a model study of bacterial adaptation and speciation under heterogeneous 
environmental conditions (Sikorski and Nevo 2005). These two canyons represent similar 
ecological sites, 40 km apart, in which the orientation of the sun yields both a strongly sun- 
exposed, hot, 'African', south-facing slope and a cooler, mesic, lush, 'European', north-facing 
slope within a distance of only 50-400 m. Based on DNA sequences, the B. simplex population 
phylogenetically splits into two major groups, GLi and GL2. These main groups are further 
subdivided into phylogenetic groups, the so-called 'putative ecotypes' (PE), which show a 
clear preference for one of the two slope types (Sikorski and Nevo 2005, Sikorski, Pukall, and 
Stackebrandt 2008b). GL2 is composed of only PEi and PE2; GLi consists of PE3-PE9. 

Sikorski et al. (2008b) analysed the physiological properties of the bacteria that might ex- 
plain their characteristic slope-type preferences. For example, the physical integrity of the cell 
membrane at different temperatures is crucial for cell survival, and particularly its fatty acid 
composition is of substantial importance. Sikorski, Brambilla, Kroppenstedt, and Tindall 
(2008a) compared the mean contents of the fatty acids that tolerate high and low tempera- 
tures of the B. simplex ecotypes. The data showed heteroscedastic variances across putative 
ecotypes, and Herberich, Sikorski, and Hothorn (2010) analysed the data using heteroscedas- 
ticity consistent covariance estimation techniques. Here, we aim at estimating the whole 
conditional distribution of the fatty acid contents (FA) of each of the six putative ecotypes 
PE3-PE7 and PE 9 from GLi. 

The simplest conditional transformation model allowing for heteroscedasticity reads 

P(FA < i;|PE = PE fc ) = <5>{a , k + a k v), k G {3, . . . , 7, 9}. (9) 

The base-learner is defined by a linear basis bo(v) = (1, v ) T for the grid variable and a dummy- 
encoding basis 61 (PE) = (/(PE = PE 3 ), . . . , /(PE = PE 9 )) T for the six putative ecotypes 
{3, . . . , 7, 9}. The resulting 12-dimensional parameter vector *y 1 of the tensor product base- 
learner then consists of separate intercept and slope parameters for each of the putative 
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Figure 1: Evolution Canyon Bacteria. Estimated conditional distribution functions (depicted 
as pl(p < 0.5) + (1 — p)I(p > 0.5) for probability p) of the fatty acid content of bacteria from 
six different putative (PE) ecotypes. Model (9) was fitted with ("linear") and without ("linear 
M = oo") early stopping via the bootstrap whereas model (10), here denoted "smooth", was 
stopped early. The observations are given as rugs. 



ecotypes. Since both bases are parametric, we set the penalty matrices Po and P\ to zero 
and choose Ao = Ai = 0, i.e. the base-learner does not penalise the estimates. Note that since 
we assume normality for the linear function ao,fc + a/%FA ~ A/"(0, 1), also the fatty acid content 
FA is assumed to be normal with mean and variance depending on the putative ecotype. Since 
no penalisation is defined for the base- learners, penalisation of the model parameters depends 
on the number of boosting iterations M . For a very large number of iterations, the algorithm 
converges to the estimate that is obtained by maximising the likelihood of the linear probit 
model fitted to the binary responses FAj < v % , and consequently we can fit this model by 
probit regression on the expanded observations in this simple setup. 

We can relax the normal assumption on FA by allowing for more flexible transformations in 
the model 

P(FA < v\PE = PE fc ) = $(/i(<u|PE = PE*)). (10) 

Now bo(v) is a vector of i3-spline basis functions evaluated at v for some reasonable choice 
of knots, while foi remains as above. Hence, instead of assuming separate linear effects for 
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the putative ecotypes, we now assume separate non-parametric effects parameterised in terms 
of S-splines. To achieve smoothness of these non-parametric effects along the u-grid, we 
specify the penalty matrix Pq as Po = D T D with second-order difference matrix D. We 
do not penalise differences between the estimated functions of different putative ecotypes, 
i.e. P\ = 0, and therefore the penalty matrix of the tensor product is simply given by 
Pox = A diag(P ,...,Po)- Although we still assume /i(FA|PE = PE fc ) ~ Af(0, 1), the 
function h may now be non-linear, and thus the conditional distribution of FA given PE may 
be any distribution that can be generated by a monotone transformation of a standard normal. 
In this sense, the model (10) is non-parametric. The validity of the normal assumption on 
FA is plausible when the estimated functions /i(t>|PE = PE&) are essentially linear in v. An 
alternative non-parametric estimate can be obtained by kernel smoothing for mixed data types 
(Li and Racine 2008, Hayfield and Racine 2008), and we compare the two fits in Figure 1. 

We fitted model (9) without (M = oo) and with early stopping (via 25-fold bootstrap strati- 
fied by PE) and model (10) with early stopping to the data and in addition report the result 
of kernel smoothing, whose bandwidth was determined via cross-validation. The conditional 
distribution functions of fatty acid content given the putative ecotype are depicted in Figure 1. 
For probabilities larger than 0.5, the estimated conditional distribution functions were mir- 
rored at the 0.5 horizontal line to allow for easier graphical inspection of medians, variances 
and potential skewness. 

With respect to the conditional median, the four models lead to very similar results and 
support the conclusion from earlier investigations (Sikorski et al. 2008a, Herberich et al. 
2010) that the fatty acid content of B. simplex from putative ecotype PE5 is smaller than 
the others and that from PE9 is larger than the others, with the remaining ones showing 
no differences. The variability cannot be assumed to be constant across different putative 
ecotypes, but there is no indication of asymmetry. Early stopping and bandwidth choice via 
resampling methods lead to almost the same estimated distribution functions. Minimising 
the empirical risk function without early stopping ("linear M = 00"; this is equivalent to 
linear probit regression on expanded observations) leads to slightly smaller variances in PE5 
and PEg. The difficulty in discriminating between the linear model (9) and the more flexible 
model (10) and the lack of asymmetry in the plots indicate that a normal assumption on fatty 
acid content is justifiable. 

Childhood Nutrition in India 

Childhood undernutrition is one of the most urgent problems in developing and transition 
countries. To provide information not only on the nutritional status but also on health and 
population trends in general, Demographic and Health Surveys (DHS) conduct nationally 
representative surveys on fertility, family planning, maternal and child health, as well as child 
survival, HIV/AIDS, malaria, and nutrition. The resulting data - from more than 200 surveys 
in 75 countries so far - are available for research purposes at www . measuredhs . com. 

Childhood nutrition is usually measured in terms of a Z score that compares the nutritional 
status of children in the population of interest with the nutritional status in a reference pop- 
ulation. The nutritional status is expressed by anthropometric characteristics, i.e. height for 
age; in cases of chronic childhood undernutrition, the reduced growth rate in human develop- 
ment is termed stunted growth or stunting. The Z score, which compares an anthropometric 
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Figure 2: Childhood Nutrition in India. Colour-coded map of the 10% and 90% conditional 
quantiles of the Z score. Each dot in the colour legend corresponds to one district with the 
respective colour in the map. Blue values in the northern part of India correspond to small 
lower and upper quantiles. Red values, especially in the eastern Meghalaya and Assam states, 
indicate small lower quantiles but at the same time large upper quantiles. In the southern 
part of India, the lower quantiles are largest with moderate upper quantiles. White parts 
indicate districts with no observations. 
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characteristic of child i to values from a reference population, is given as 

_ ACi - m 

Z% — 

s 

where AC denotes the anthropometric characteristic of interest and m and s correspond to 
median and (a robust estimate for the) standard deviation in the reference population (strati- 
fied with respect to age, gender, and some other variables). While weight might be considered 
as the most obvious indicator for undernutrition, we will focus on stunting, i.e. insufficient 
AC = height for age, in the following. Stunting provides a measure of chronic undernutrition, 
whereas insufficient weight for age might result from either acute or chronic undernutrition. 
Note that the Z score, despite its name, is not assumed to be normal. 

Here we focus on estimating the whole distribution of the Z score measure for childhood nutri- 
tion in India, one of the fastest growing economies and the second-most populated country in 
the world. Our investigation is based on India's 1998-1999 Demographic and Health Survey 
(DHS, NFHS 2000) on 24, 166 children visited during the survey in 412 of the 640 districts of 
India. The lower quantiles of this distribution can be used to assess the severity of childhood 
undernutrition, whereas the upper quantiles give us information about the nutritional status 
of children in families with above-average nutritional status. 

The simplest conditional transformation model allowing for district-specific means and vari- 
ances reads 

F(Z < v (district = k) = $(a ,fc + oi k v), k = 1, . . . , 412. 

The base-learner is defined by a linear basis bo(v) = (1, t>) T for the grid variable and a dummy- 
encoding basis b\ (district) = (/(district = 1), . . . , /(district = k)) T for the 412 districts. The 
resulting 824-dimensional parameter vector -ji of the tensor product base-learner then consists 
of separate intercept and slope parameters for each of the districts of India. Note that since 
we assume normality for the linear function aQ^ + a^Z ~ A/"(0, 1), also the Z score is assumed 
to be normal with mean and variance depending on the district. 

We can relax the normal assumption on Z by allowing for more flexible transformations in 
the model 

F(Z < u|district = k) = $(ft(v|district = k)), k = 1, . . . , 412. (11) 

Now bo(v) is a vector of S-spline basis functions evaluated at v for some reasonable choice 
of knots, while &i remains as above. Hence, instead of assuming separate linear effects for 
the districts, we now assume separate non-parametric effects parameterised in terms of B- 
splines. To achieve smoothness of these non-parametric effects along the u-grid, we specify 
the penalty matrix Pq as Pq = D D with second-order difference matrix D. It makes 
sense to induce spatial smoothness on the conditional distribution functions of neighbouring 
districts since we do not expect the distribution of the Z score to change much from one 
district to its neighbouring districts. In fact, spatial smoothing is absolutely necessary in 
this example since otherwise we would estimate 412 separate distribution functions for the 
districts in India. To implement spatial smoothness of neighbouring districts, the penalty 
matrix P\ is chosen as an adjacency matrix, where the off-diagonal elements indicate whether 
two districts are neighbours (represented with a value of —1) or not (represented with a value 
of 0). The diagonal of the adjacency matrix contains the number of neighbours for the 
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corresponding district. The estimated conditional transformation function /i(Z|district = k) 
can be interpreted as a transformation of the Z scores in district k to standard normality. 
Because the number of observations is large and the base-learner is fitted with penalisation, 
we stopped the boosting algorithm according to the in-sample empirical risk. 

From the estimated conditional distribution functions, we compute the r quantiles of the Z 
score for each district via 

Q(r|district = k) = ini{v : §{h{v\ district = k) > r}. 

The conditional 10% and 90% quantiles are depicted in a colour-coded map in Figure 2. 
The spatially smooth estimated lower and upper conditional quantiles shown simultaneously 
allow differentiation between three groups of districts: (A) districts with small lower and 
upper conditional quantiles (blue, especially in the Uttar Pradesh state), where the Z score 
is stochastically smaller than that of the remaining parts of India and thus all children are 
less well fed; (B) districts with more severe inequality, i.e. small lower but at the same 
time large upper quantiles (red, in the Meghalaya and Assam states); and (C) districts with 
relatively large lower and upper quantiles, which indicates a relatively good nutrition status 
of all children in the southern districts of India (violet, in Andhra Pradesh, Madhya Pradesh, 
Maharashtra, Tamil Nadu, and Kerala). 

Head Circumference Growth 

The Fourth Dutch Growth Study (Fredriks, van Buuren, Burgmeijer, Meulmeester, Beuker, 
Brugman, Roede, Verloove-Vanhorick, and Wit 2000) is a cross-sectional study that measures 
growth and development of the Dutch population between the ages of and 22 years. The 
study measured, among other variables, head circumference (HC) and age of 7482 males 
and 7018 females. Stasinopoulos and Rigby (2007) analysed the head circumference of 7040 
males with explanatory variable age using a GAMLSS model with a Box-Cox t distribution 
describing the first four moments of head circumference conditionally on age. The models show 
evidence of kurtosis, especially for older boys. We estimate the whole conditional distribution 
function via the conditional transformation model 

P(HC < i;|age = x) = &(h(v\age = x)). 

The base- learner is the tensor product of S-spline basis functions bo(v) for head circumference 
and -B-spline basis functions for age 1 / 3 . The root transformation just helps to cover the 
data better with equidistant knots. The penalty matrices Pq and Pi penalise second-order 
differences, and thus h will be a smooth bivariate tensor product spline of head circumference 
and age. It is important to note that smoothing takes place in both dimensions. Consequently, 
the conditional distribution functions will change only slowly with age, which is a reasonable 
assumption. Since the number of observations is also large, we stopped the algorithm based 
on the in-sample empirical risk. 

Figure 3 shows the data overlaid with quantile curves obtained via inversion of the estimated 
conditional distributions. The figure can be directly compared with Figure 16 of Stasinopoulos 
and Rigby (2007) and also indicates a certain asymmetry towards older boys. 

Deer-vehicle Collisions 

Collisions of vehicles with roe deer are a serious threat to human health and animal welfare. 
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Figure 3: Head Circumference Growth. Observed head circumference and age for 7040 boys 
with estimated quantile curves for r = 0.04, 0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98, 0.996. 



In Bavaria, Germany, more than 40, 000 deer-vehicle collisions (DVCs) take place every year. 
Hothorn, Brandl, and Miiller (2012) investigated the spatial distribution of the risk of deer- 
vehicle collisions; here we focus on the temporal aspect of the risk for two years, 2006 and 
2009. For all 74, 650 collisions reported to the police in these two years, we attributed each 
accident to the specific day of the year. 

Although the number of DVCs is a discrete random variable, the distribution of the number 
of DVCs conditional on the day of the year can be estimated by means of an appropriate 
base-learner using the model 

P(DVCs < u|day = :ri,year = x?) = $(/ii(u|day = x{) + /i2(u|day = xi,year = X2)). 

Here, ft is the counting measure with support vi,..., vn equal to the support of the empirical 
distribution of the response. Conceptually, the basis function bo should allow for n = N 
parameters (one for each v t ), whose first-order differences should not become too large. To 
restrict the number of parameters in the base- learners, we use 5-splines to approximate such 
a discrete function on the f-grid. It should further be noted that the day of year is a discrete 
cyclic random variable. Therefore, we chose b\(xi) as cyclic l?-splines of the day, which are 
obtained by a simple modification of the S-spline design matrix and the difference penalty 
that results from fusing the two ends of the co-domain. In analogy, a cyclic S-spline is applied 
to the varying coefficient term 62(^17^2) = &i(^i) x I{x2 = 2009), which captures temporal 
differences between the two years and yields a cyclic i?-spline of the days in 2009. Since the 
data are discrete, we only penalise first-order differences in both base- learners. 
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Figure 4: Deer-vehicle Collisions. Number of deer-vehicle collisions (DVCs) per day 
in 2006 and 2009 in Bavaria, Germany, with estimated quantile curves for r = 
0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95. 



Figure 4 shows three risk peaks. The first one occurs early in May - the beginning of the 
growing and buck hunting season - and ends mid-June. A second and sharper peak is observed 
in the first week of August and corresponds to the mating season of roe deer. After a low- 
risk period of approximately six weeks, the risk starts to increase again at the beginning of 
October and slowly decreases until April for reasons yet unknown. Note that the distribution 
in 2009 has a larger median than that in 2006 but also shows less extreme peaks. 



Birth Weight Prediction 

Recent advances in neonatal medicine have lowered the threshold of survival to a gestational 
age of 23-24 weeks and to a birth weight of approximately 500 g. As neonatal risks of 
morbidity and mortality are highest in the lowest weight range, diagnostic assessment of the 
small foetus needs to be as precise as possible. Schild, Maringa, Siemer, Meurer, Hart, Goecke, 
Schmid, Hothorn, and Hansmann (2008) focused on this high-risk group of small foetuses 
(< 1600 g) and proposed a formula for estimating birth weight based on ultrasound imaging 
performed within seven days before delivery. In addition to predicting the expected birth 
weight given four standard 2D ultrasound parameters (HC: head circumference, FE: femur 
length, BPD: biparietal diameter, and AC: transverse diameter and circumference of the foetal 
abdomen) and three additional 3D ultrasound parameters (UA: upper arm volume, FEM: 
thigh volume, and ABDO: abdominal volume) X £ I 7 , we aim at assessing the uncertainty of 
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this prediction. The data on 150 predominantly Caucasian women, collected in a prospective 
cohort study at the universities in Bonn and Erlangen, Germany, analysed by Schild et al. 
(2008), were utilised to derive 80% prediction intervals for birth weight (BW). 

We begin with the linear model estimated by Schild et al. (2008) 

BW X = 656.41 + 1.832 xABDO + 31.198 xHC + 5.779 xFEM + 

73.521 x FL + 8.301 x AC - 449.886 x BPD + 32.534 x BPD 2 + 
77.465 x $ _1 (*7), 

and the classical prediction interval for a foetus with ultrasound parameters x is then the 
symmetric interval around the estimated conditional mean E(BW|X = x), whose width is 

given by 2 x ii 5 Q_ 8 ,o.9 X 77.465 x ^1 + Var(E(BW|X = x)). 

The normal assumption can be relaxed by deriving the upper and lower conditional quantiles 
from two quantile regression models. Linear quantile regression (Koenker and Bassett 1978) 
for the conditional 10%, 50% and 90% quantiles assumes that 

BW X = a ,r + x T a T + Q T (U), for r = 0.1, r = 0.5, and r = 0.9 

with Q t (t) = 0. The corresponding prediction interval for a foetus with ultrasound parame- 
ters x is now (&o,o.i + xT &o.i, o-ofl.9 + s^&o.d)- A more flexible description of the functional 
relationship between ultrasound parameters and quantiles is given by the additive quantile 
regression model (Koenker, Ng, and Portnoy 1994) 

7 

BW, = a , T + r iA x s) + Q T (U), for t = 0.1, r = 0.5, and r = 0.9. 
j=l 

Here, Tj T is a quantile-specific smooth function of the jth ultrasound parameter. Parameter 
tuning is difficult for these models; we therefore applied a boosting approach to additive quan- 
tile regression (Fenske, Kneib, and Hothorn 2011, with early stopping via 25-fold bootstrap). 
Prediction intervals can now be derived by («o,o.i + Ylj=i ^jfl-i( x j)i "0,0.9 + Ylj=i ?j,0.9(xj)). 
Note that for either quantile regression model, the prediction interval is based on two separate 
models: one for r = 0.1 and one for r = 0.9. 

Finally, we derive prediction intervals from the conditional transformation model 

P(BW < v\X = a?) = $ h j( v \ x j)j 

where, under the assumption of additivity of the transformation function h, each ultrasound 
parameter may influence the moments of the conditional birth weight distribution. The jth. 
base-learner is the tensor product of l?-spline basis functions bo(v) for birth weight and bj(xj) 
are £>-spline basis functions for the jth. ultrasound parameter. The penalty matrices Po and 
Pj penalise second-order differences, and thus all estimates hj will be smooth bivariate tensor 
product splines of birth weight and the respective ultrasound parameter, with both dimensions 
being subject to smoothing. The number of boosting iterations was determined by 25-fold 
bootstrap. From the estimated conditional distribution functions, we compute the r quantiles 
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of the birth weight via 

Q(t\X = x) = inf ju : $ |ay) J > r J 

and derive the prediction interval as (<5(0.1| X = a;), Q(0.9|X = a;)). Note that, unlike in the 
quantile regression approach, the prediction interval obtained from the conditional transfor- 
mation model is based on only one model that describes the whole conditional distribution of 
birth weight. 

The observed birth weights ordered with respect to the predicted mean (linear model) or 
median (quantile regression and conditional transformation model) are depicted in Figure 5. 
In addition, the respective 80% prediction intervals are visualised by grey areas. It must 
be noted that, for all models, the prediction intervals are only interpretable for future ob- 
servations; however, poor coverage for the learning sample also indicates poor coverage for 
future cases. The prediction intervals obtained from linear quantile regression indicate that 
the model is confident about its predictions over the whole range of birth weights. This is also 
the case for the additive quantile regression models for birth weights of approximately 1000 
g, but the uncertainty increases for very small and larger foetuses. The intervals obtained 
from the linear model and the conditional transformation model appear to be similar. For 
birth weights between 500 and 1400 g, the prediction intervals of the conditional transforma- 
tion model are symmetric around the median. This might be an indication that the normal 
assumption by the linear model is not completely unrealistic. The smaller interval widths 
that can be seen for the linear model are most likely due to the variance estimate in this case 
ignoring the model choice process that was performed prior to the final model fit by Schild 
et al. (2008). The conditional transformation model takes this variability into account. The 
results may also be an indication that the assumption of additivity of the transformation 
function rather than of the regression function (for quantile regression models) might be more 
appropriate for modelling birth weights. 



Beyond Mean Boston Housing Values 

The Boston Housing data, first published by Harrison and Rubinfeld (1978) and later cor- 
rected and spatially aligned by Gilley and Pace (1996), have become a standard test-bed for 
variable selection and model choice. Almost exclusively, the 13 explanatory variables have 
been selected with respect to their influence on the mean or median of the conditional median 
house value in a certain tract. Assuming a conditional transformation model, we attempt to 
detect dependencies of higher moments of the conditional median house value from the ex- 
planatory variables. We focus on the 12 numeric explanatory variables and ignore the binary 
variable coding for Charles River boundary in the conditional transformation model 

(12 12 
M x j) + X] hj(v\xj) 

In this model, extract is a tract-specific, spatial random effect, whose correlation structure is 
determined by a Markov random field defined by the neighbouring structure of the tracts cap- 
turing spatial autocorrelation and heterogeneity (similar as in the example on childhood nutri- 
tion in India). The term ho(v\l) is an unconditional transformation of the median house value, 
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Figure 5: Birth Weight Prediction. Observed birth weights for 150 small foetuses (dots), 
ordered with respect to the estimated mean or median expected birth weight (central black 
line). The shaded area represents foetus specific 80% prediction intervals for the linear model 
(LM), linear quantile regression model (LQR), additive quantile regression model (AQR) and 
conditional transformation model (CTM). 
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i.e. this transformation is independent of the explanatory variables. The explanatory vari- 
ables may influence the mean of the transformed median house value /io(MEDV|l) via h x (x) = 
Y2j=i hj(M x j) om y or may also affect higher moments via the interaction terms X]j=i hj( v \ x j)- 
The latter term extends the transformation model foo(MEDV|l) + Ylj=i hj(M x j) to a condi- 
tional transformation model. The base- learners for the transformation function ho(v\l), the 
effects hj(l\xj) and the interaction terms hj(v\xj) are constructed based on cubic l?-spline 
basis functions supplemented with second-order difference penalty. More specifically, bj (x) 
and bo(v) are both represented in terms of a reparameterisation of the S-spline basis func- 
tions that allows separation of the non- linear terms into a constant, a linear effect and the 
non-linear (orthogonal) deviation from the linear effect, i.e. 

bj(x) = 1 + Xj + bj(xj) and bo(v) = 1 + v + t>o(v), 

where bj(xj) and bo(v) are the non-linear deviation effects. Taking the tensor product after 
applying the decomposition yields a decomposition into linear and non-linear main effects 
of Xj and v as well as linear and non-linear interaction terms (see Fahrmeir, Kneib, and 
Lang 2004, Kneib, Hothorn, and Tutz 2009, for technical details of this decomposition). The 
advantage of this expanded parameterisation is that the automatic model choice capabilities 
of the boosting algorithm allow us to flexibly determine whether linear or non-linear effects 
are required and whether there actually is an interaction between the transformation function 
and specific effects of explanatory variables. 

Censored observations were dealt with by choosing inverse probability of censoring weights 
Wi for the empirical risk function (4) derived from the Kaplan-Meier estimate of the censoring 
distribution. The stability selection procedure (Meinshausen and Buhlmann 2010) selected 
three variables that have an influence on the conditional distribution of the median housing 
value (MEDV), namely per capita crime (CRIM), average numbers of rooms per dwelling 
(RM), and percentage values of lower status population (LSTAT). After variable selection, 
we refitted a conditional transformation model of the simpler form 

P(MEDV < f | CRIM, RM, LSTAT) 

= $ (/icrimWCRLM) + /irmMRM) + /^statH LSTAT)) , 

where the base-learners are tensor products of S-spline bases. The fitted functions can be 
conveniently depicted in the observation space. For example, a scatter plot of MEDV and 
CRIM and a grey- level image of the bivariate function /icMM(MEDV|CRIM) can be viewed 
in the same coordinate system. Similar to the mirrored distribution functions in Figure 1, we 
show negative absolute values of the fitted functions h for easier interpretation. 

Figure 6 indicates that the percentage values of lower status population (LSTAT) lead to 
smaller values of the median housing value at almost constant variance. However, the condi- 
tional distribution will be skewed towards higher MEDV values. For tracts with small average 
numbers of rooms per dwelling (RM), the median housing value is small and increases with 
increasing numbers of rooms. The same applies to the variability, since the estimated function 
/irm(MEDV|RM) shows more spread for larger values of RM. Per capita crime seems to have 
an effect on variability and skewness, since for larger crime values, the distribution will be 
heavily skewed and less variable than small per capita crime values. However, compared to the 
other two variables, the influence is only of marginal value due to small absolute contributions 
of this model term to the full model. 
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Figure 6: Beyond Mean Boston Housing Values. Conditional transformation model for the 
three selected variables per capita crime (CRIM), average numbers of rooms per dwelling 
(RM) and percentage values of lower status population (LSTAT). Each panel depicts the 
data as scatter plots along with the corresponding negative absolute values of the estimated 
transformation function at the probit scale. The explanatory variables were standardised 
prior to analysis. 
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7. Empirical Evaluation 

We shall compare the empirical performance of conditional transformation models fitted by 
means of the proposed boosting algorithm to two competitors. Conditional transformation 
models are semiparametric models in the sense that we assume a certain distribution for the 
transformed responses and additivity of the model terms on the scale of the corresponding 
quantile function. Therefore, it is natural to compare the goodness of the estimated condi- 
tional distribution functions to a fully parametric approach and a non-parametric estimation 
technique. 

For the sake of simplicity, we study a model in which two explanatory variables influence both 
the conditional expectation and the conditional variance of a normal distributed response Y. 
The error term <J> _1 (J7) is standard normal, and, to obtain normal responses, we restrict the 
possible transformations to linear functions: 

$-\U) = h(Y x \x) = Y J h J (Y x \x) = Y J b j (x)Y x -a J (x) 

j j 

Although the partial transformation functions are linear in Y x , the expectation and variance 
depend on the explanatory variables in a non-linear way. The choices X\ ~ U[0, 1],X 2 ~ 
U[— 2,2], a\(x) = 0,02(2:) = x 2 , and b\(x) = x±,b 2 (x) = 0.5 lead to the heteroscedastic 
varying coefficient model 

y * = i7Ta5- + S7Ta^ 1(t "- (12 > 

where the variance of Y x ranges between 0.44 and 4 depending on X±. This model can be 
fitted in the GAMLSS framework under the assumptions that the expectation of the normal 
response depends on a smoothly varying regression coefficient (X\ + 0.5) -1 for X2 and that 
the variance is a smooth function of X\. This model is therefore fully parametric. As a non- 
parametric counterpart, we use a kernel estimator for estimating the conditional distribution 
function of Y x as a function of the two explanatory variables. 

The conditional transformation model 




F(Y < v\Xi = Xl ,X 2 = x 2 ) = <f>(h(v\x 1 ,x 2 )) = + h 2 {v\x 2 )) 

is a semiparametric compromise between these two extremes. The error distribution is as- 
sumed to be standard normal and additivity of the transformation function h is also part 
of the model specification. The base-learners are tensor products of S-spline basis functions 
bo (v) for Y and S-spline basis functions for X\ and X 2 , respectively. The penalty matrices 
Po, Pi and P 2 penalise second-order differences, and thus hj will be smooth bivariate tensor 
product splines of the response and explanatory variables X\ and X 2 . Smoothing takes place 
in both dimensions. 
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For all three approaches, we obtain estimates of ¥(Y < v\Xi = x\,X 2 = x 2 ) over a grid on 
x\, X2 and compute the mean absolute deviation (MAD) of the true and estimated probabilities 

MAD(xi,s 2 ) = 1 V \P(X < v\X x = x u X 2 = x 2 )-P(Y < v\X x = x x ,X 2 = x 2 )\ 
n — ' 

V 

for each pair of x\ and x 2 . Then, the minimum, the median, and the maximum of the MAD 
values over this grid are computed as summary statistics. This procedure was repeated for 
100 random samples of size N = 200 drawn from model (12). Cross-validation was used to 
determine the band widths for the kernel-based methods, for details see Hayfield and Racine 
(2008). The boosting-based estimation of GAMLSS models (Mayr, Fenske, Hofner, Kneib, 
and Schmid 2012a) turned out to be more stable than the reference implementation (package 
gamlss, Stasinopoulos, Rigby, and Akantziliotou 2011), and we therefore fitted the GAMLSS 
models by the dedicated boosting algorithm. For GAMLSS and conditional transformation 
models fitted by boosting, the number of boosting iterations was determined via sample- 
splitting. To investigate the stability of the three procedures under non-informative explana- 
tory variables, we added p = 1, . . . , 5 uniformly distributed variables without association to 
the response to the data and included them as potential explanatory variables in the three 
models. The case p = corresponds to model (12). 

Figure 7 shows the empirical distributions of the minimum, median and maximum MAD 
for the three competitors. For p = 0, GAMLSS and conditional transformation models 
perform on par with respect to the median MAD, although GAMLSS shows a somewhat 
larger variability. The median MAD is slightly smaller than 0.02 for both procedures, which 
indicates that the true conditional distribution function can be fitted precisely. The maximal 
MAD is smallest for conditional transformation models and can be quite large for GAMLSS. 
In contrast, for some configurations of the explanatory variables, GAMLSS seems to offer 
better estimates with respect to the minimal MAD. The kernel estimator leads to the largest 
median MAD values but seems more robust than GAMLSS with respect to the maximal MAD. 
These results are remarkably robust in the presence of up to five non-informative explanatory 
variables, although of course the MAD increases with p. 

The general theme that GAMLSS on average performs as well as conditional transformation 
models in the special case of model (12) but is associated with a larger variability might be 
explained by the independent estimation of the functions for the expectation and variance, 
i.e. GAMLSS does not "know" that the varying coefficient term and the variance term are 
actually the same. The inferior performance of the kernel estimator might be explained by the 
technical difficulties associated with bandwidth choice. The tuning parameters for the two 
boosting approaches are easier to choose. Our general impression is that the kernel-estimated 
conditional distribution functions are more erratic than the smooth functions obtained with 
boosting for conditional transformation models (analysis of simulation data not shown here). 

Since conditional transformation models are also an alternative to quantile regression models, 
it would be interesting to compare both approaches. At this point, it is important to recall 
that the two models assume additivity of the effects of X\ and X 2 , however, on different scales 
as explained in Section 2. Consequently, the heteroscedastic varying coefficient model (12) 
cannot be fitted in a straightforward way using standard linear or additive quantile regression. 
Though, the estimation problem can be slightly reformulated by describing the r-quantile of 
Y x as the sum of a varying coefficient term r\{x{)x 2 and a smooth function r 2 {x\). This 
model, implemented using the boosting approach to additive quantile regression with varying 
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Figure 8: Comparison of conditional transformation models and additive quantile regression. 
Scatterplot of true versus estimated quantiles obtained from one conditional transformation 
model (CTM) and from three additive quantile regression (AQR) models fitted to 200 obser- 
vations drawn from the heteroscedastic varying coefficient model (12). 



coefficients introduced by Fenske et al. (2011), allows the estimation of conditional T-quantiles. 
We fitted three such quantile regression models (for r = 0.5,0.75,0.9) to a sample of size 
N = 200 from model (12) and determined the optimal number of boosting iterations by the 
out-of-bootstrap empirical risk of the check function. To give an impression, we compare these 
estimated r-quantiles with the corresponding conditional quantiles obtained by inverting the 
estimated conditional distribution function from a conditional transformation model. Figure 8 
displays scatterplots of the true conditional quantiles over a grid of X\ and X2 values with the 
corresponding estimated quantiles derived from one conditional transformation model and 
the three additive quantile regression models for r = 0.5, 0.75, 0.9, the latter models including 
the varying coefficient term. It seems that, in this example, both approaches recover the true 
quantiles equally good. 



8. Discussion 

In his book Quantile Regression, Koenker (2005) puts transformation models in the "twilight 
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zone of quantile regression" and suggests that estimating conditional distribution functions by 
means of transformation models might be an alternative to the direct estimation of conditional 
quantile functions. We undertook the "worthwhile exercise" (Koenker 2005, Section 8.1.1.) 
and developed a semiparametric framework for the estimation of conditional distribution 
functions by conditional transformation models that allows higher moments of the conditional 
distribution to depend on the explanatory variables. 

Because the empirical risk function (4) is equivalent to well-established risk functions for bi- 
nary data, there are many potentially interesting algorithms that can be used to fit conditional 
transformation models, although dependent observations have to be dealt with. We chose a 
component-wise boosting approach mainly because of its divide-and-conquer strategy, which 
allows a very efficient fitting of base-learners that depend on the response and on one or more 
explanatory variables at the same time via linear array models. In addition, the algorithm 
is general, and different base-learners appropriate for the model at hand can be easily imple- 
mented as illustrated in Section 6. Also very attractive are the model choice properties, as 
for example illustrated by means of the Boston Housing data in Section 6. The theoretical 
and empirical properties investigated in Sections 5 and 7 indicated that the method is ap- 
plicable to more high-dimensional situations as well. While boosting became popular owing 
to its success in fitting simple models under challenging circumstances - especially linear or 
additive models for high-dimensional explanatory variables - the attractiveness of this class of 
algorithms for fitting challenging models in simple circumstances has been only rarely recog- 
nised. Exceptions are Ridgeway (2002) and Sexton and Laake (2012), who study boosting 
algorithms for fitting density functions. Lu and Li (2008), Schmid and Hothorn (2008) and 
Schmid, Hothorn, Maloney, Weller, and Potapov (2011) proposed boosting algorithms for 
transformation models that treat the transformation function hy as a nuisance parameter. 
In the same model framework, Tutz and Groll (2013) propose a likelihood-boosting approach 
for fitting cumulative and sequential models for ordinal responses. 

Boosting algorithms for estimating conditional quantiles by minimising the check function 
have been introduced by Kriegler and Berk (2010), Fenske et al. (2011), and Zheng (2012). 
The computation of prediction intervals based on pairs of such models is rather straightforward 
(Mayr, Hothorn, and Fenske 2012b). Our approach to the estimation of the conditional 
distribution function has the advantage that one model fits the whole distribution, which 
can then be used to derive arbitrary functionals from. The quantile score representation of 
the continuous ranked probability score (see Gneiting and Ranjan 2011) might be a basis to 
develop a boosting technique similar to the one described in this paper for the estimation of 
full conditional quantile functions. The main difference between transformation and quantile 
regression models that we have to keep in mind is that additivity is assumed on two different 
scales. From a practical point of view, diagnostic tools to assess on which of these scales it is 
more appropriate to assume an additive model would be very important. 

The applications presented in Section 6 showed that conditional transformation models are 
generic, and one can, by choosing appropriate base- learners, fit models that are specific to the 
problem at hand. An empirical evaluation showed that the estimated conditional distribu- 
tion functions are on average as good as the estimates obtained from a parametric approach 
(GAMLSS) that relies on more assumptions. In comparison to non-parametric kernel dis- 
tribution estimators, conditional transformation models are more adaptable, for example to 
spatial or temporal data. The performance of the semiparametric models compared to that of 
the non-parametric competitor was considerably better at the small price of the assumption 
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of additivity of the transformation function. 

It will be interesting to further study conditional transformation models with respect to the 
following extensions. Instead of making assumptions about the quantile function Q represent- 
ing the error distribution, it would be possible to fit the corresponding distribution function 
by techniques introduced for single index models (Tutz and Petry 2012). To reduce the com- 
plexity, one could restrict the partial transformation functions hj to linear functions, i. e. only 
the first two moments are allowed to be influenced by the explanatory variables. Furthermore, 
when the response Y = {Y\,Y 2 ) is bivariate, the bivariate conditional distribution function 
P(Yi < t>i,Y 2 < v 2 \X = x) can be estimated by minimising the risk function 

p{((yi < v\ Ay 2 < v 2 ), x), h(vi,v 2 \x)} dfi(vi) dfi(v 2 ) dF Y ,x(yi,y2, x) 

in h. It might also be interesting to allow only certain moments to depend on certain ex- 
planatory variables. For example, a partially conditional transformation model of the form 

Ji J 
h(v\x) = hj(v\x) + hj(l\x) 
3=1 j=-h+i 

describes the conditional expectation ^/=Ji+i hj(M x ) °f a transformation YljLx hj(Y x \x). 
Researchers pay most attention to the conditional distribution function in applications where 
the response is a survival time or has been censored owing to other reasons. In survival anal- 
ysis, it is common to deal with the conditional survivor function S(Y x \x) = 1 — F(Y x \x) for 
survival time Y . In the seemingly simpler situation of an uncensored continuous response, 
many data analysts focus on the conditional mean and do not bother with the conditional 
distribution function at all. Most of the literature on transformation models therefore deals 
with the censored case. The easiest way to deal with right-censoring is to minimise the 
risk function weighted by inverse probability of censoring weights, as in the generalised es- 
timating equations approach by Cheng et al. (1995), where parameters of a linear trans- 
formation model are estimated by the root of a V-statistic defined by all binary indicators 
I(Yi > Yt),i,i = 1, . . . , N. Accelerated failure time models fitted by boosting of an inverse 
probability of censoring weighted risk function have been described by Hothorn, Biihlmann, 
Dudoit, Molinaro, and van der Laan (2006), and future research awaits the investigation of 
the performance of conditional transformation models under censoring. 

Computational Details 

Conditional transformation models were fitted using an implementation of component-wise 
boosting in package mboost (version 2.1-2, Hothorn, Biihlmann, Kneib, Schmid, and Hofner 
2011). Package gamboostLSS (version 1.0-3, Hofner, Mayr, Fenske, and Schmid 2011b) was 
used to fit GAMLSS models and kernel distribution estimation was performed using package 
np (version 0.40-13, Hayfield and Racine 2011). Linear quantile regression was computed using 
package quantreg (version 4.79, Koenker 2011). All computations were performed using R 
version 2.13.2 (R Development Core Team 2011). 

Throughout Section 6, we used the loss function defined by pb m and modelled non-linear 
functions by cubic S-spline bases with 20 equi-distant knots. For further computational 
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details we refer the reader to the R code that implements the analyses presented in Sections 6 
and 7, which is available in an experimental R package ctm at http : //R-f orge . R-project . 
org/projects/ctm. The results presented in this paper can be reproduced using this package, 
except for the birth weight data, which are not publically available. 
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Appendix 



Proofs 



Proof. Proof of Lemma 1 

Convexity: If the loss function p is convex in its second argument, so is the loss function i 

l((Y, X), ah + (l- a)g) < a£((Y, X),h) + (1- a)£((Y, X),g), 

with g(-\x) : M — > R being a monotone increasing transformation function and a £ [0,1], 
because of the convexity of p and the monotonicity and linearity of the Lebesgue integral. 

Population Minimisers: Let / denote the density of F. With iterated expectation we have 
E YtX £((Y,X),h) = j j J P((V <v,x),h(v\x))dp(v)dP Ylx=x (y)dP x (x) 
= j j Jp({y<v,x),h(v\x))dF Ylx=x (y)d f i(v)dF x (x) 

V v ' 

= :A v ^(h(v\x)) 

and the risk is minimal when A V:X (h(v \x)) is minimal for the scalar h(v\x) for all v and x, 
i.e. when 

j_ QA VX (h(v\x)) 
dh{v\x) 

dh{v\x) P ^ V ~ v ' x ^ h ( y \ x ^ d¥ Y\x= x {y) 
P= ^ a f (I(y <v)- F(h(v\x))f(h(v\x)) dP Y{x=x (y) 

= f(h(v\x)) {Jl(v< v) dF Ylx=x (y) - F(h(v\x)) 

f(h(v\x)) {F(Y < v\X = x)- F(h(v\x))} 

which for f(h(v\x)) > is zero for h(v\x) = F^ 1 (P(Y < v\X = x)). Similar, for p = pbi: 
the term 

j_ dA v , x (h(v\x)) 
dh{v\x) 

= j-{Mm f{mx)) - ^w)) f{mx)) ) d¥ ^ x - x{v) 

Jl-I(y<v) dP Ylx=x (y) J I(y < v) dF Y \ x=x (y) 

1-F(h(v\x)) F(h(v\x)) 
1 - P(Y < v\X = x) F(Y<v\X = x) 



f(h(v\x)) 

f(h(v\x))^ x _ F{h{v \ x)) F{h{v\x)) 

is zero for h(v\x) = F^ 1 (P(Y < v\X = x)) when f(h(v\x)) > 0. 
For the absolute error, note that 

p,bc((Y < v, X),h(v\X)) = I(Y < v){l - F{h{v\X))} + {1 - I(Y < v)}F(h(v\X)) 
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and thus 

A V)X (h(v\x)) = J Pabe((y < v,x),h(v\x))dP Y \x= x {y) 

= J I(Y < v){l - F(h(v\X))} + {1 - I(Y < v)}F(h(v\X)) dF Y \ x=x (y) 
= {1 - F(h(v\X))}¥(Y < v\X = x) +F(h(v\X)){l - F(Y < v\X = x)} 

This expression attains its minimal value of ¥(Y < v\X = x) for P(Y < v\ X = x) < 0.5 
when F(h(v\X)) = 0. For P(Y < v\X = x) > 0.5, the minimum 1 - F(Y < v\X = x) is 
attained when F{h{v\X)) = 1. Thus, absolute error will lead to too extreme estimated values 
of h and corresponding conditional distribution functions. □ 

Proof. Proof of Theorem 1 

We use a modified argument of a proof presented in Section 12.8.2. in Buhlmann and van de 
Geer (2011). Formally, we can write 

I(Yi < v t ) = h lQN {v l \X i ) + £jj, 

e it = I(Yi < v t ) -h lQN (v t \Xi) (i = l,...,N, i = l,...,n). 
The errors £j, have reasonable properties, as discussed in (13) below. 

There are two issues that need to be addressed. First, we define the inner products of functions 
h and g 

n 

(h, g) n ,N,m = rT 1 E[h(v l \X)g(v l \X)] 

and 

N n 

{h,g) ntN = n- l N- 1 Y J S ^ZKyi\X i )g{v l \X i ). 
i=i i=i 

The proof in Buhlmann and van de Geer (2011) can then be used with the scalar product 

(h,g) n ,N- 

Secondly, for controlling the probabilistic part of the proof, we need to show that the analogue 
of formula (12.26) in Buhlmann and van de Geer (2011) holds. This translates to deriving a 
bound for 

N n 

max (b ,k bj,k^ £ )n,N = max (niV) -1 bo^oi^bj^ (Xj)eu. 

i=i i=i 

Because /i 7o N is the projection of I(Y < v l ) (i = 1, . . . , n) onto the basis functions 
b j ( . (vi)bj ! k 1 (Xj)i = 1, . . . ,n) with respect to the || • || nj 7v,E- n orm (see (7)), and owing to the 
definition of e^, we have 



n 

i=i 



-^n^boM^bjMiXj)} = Vj, k , h. (13) 

Therefore, 

(nN)- 1 ^2 Yl b 0M^) h ]M{X 3 )e it = N' 1 ^ Z iU, k), h 



N n N 



i=l i=l i=l 

E[Z i {j,k o ,ki)] = 0. 
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Furthermore, owing to the boundedness assumption in (Al), \\Zi(j, ko, &i)||oo < C\ for some 
constant C% < oo \/i,j,ko,k±. Applying Hoeffding's inequality, for independent (but not 
necessarily identically distributed) random variables (van de Geer 2000, Lem. 3.5) and using 
the union bound, we obtain 

max {b 0}ko bj, kl ,£)n,N = O p ( J\og(J N K QN Ki N )/N ) . 

This, together with the proof from Section 12.8.2 in Biihlmann and van de Geer (2011) 
completes the proof of Theorem 1. □ 



Gradients 

Gradients for different loss functions p and arbitrary absolute continuous distribution func- 
tions F with density function /: 



, P-m, I £gj < _ 1 - I(Y, < v,) . . , 




U u P= ^ bc {I(Y l <v l )-(l-I(Y l <v l ))}f(h^ ] ) 
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